## Necessary Packages
library(knitr)
devtools:: install_github("bayesball/CalledStrike")
library(baseballr)
library(tidyverse)
library(RSQLite)
library(colorspace)
library(dplyr)
library(dbplyr)
library(moderndive)
library(broom)
library(ggfortify)
library(viridis)
library(scales)
library(CalledStrike)
library(mgcv)
library(plotly)

Final usable dataset:

Baseballr package by Bill Petti

The package contains play by play data for each year of the MLB season as well as Statcast data going back to 2015. Statcast data is some of our biggest interest in the dataset. Because there are so much data in the set, we have to make separate databases if we want to look at an entire year’s worth of play by play data. So, we have the entire year of 2019 stored in a SQL database connected to R.

Big Idea of Project:

Over time, home runs have dramatically increased. There have been different theories of reasons why, and the most recent theory is known as the “fly ball revolution”. From 2016 to present, the total home run record has been shattered each year (excluding the shortened 2020 COVID season.) The theory from many around the industry is that the access to advanced data around the league has led to many hitters adopting a principle that hitting the ball on the ground is bad, and the best way to optimize performance and value as a hitter is to hit the ball as hard as you possibly can while getting it off the ground (optimal launch angle and exit velocity).

The premise is that even if you are hitting balls hard on the ground, they are most likely leading to outs because major league infielders aren’t going to make mistakes that often. On the other hand, hitting the ball hard in the air gives you a higher percentage of not only getting a hit, but also the most valuable form of a hit: the home run. Another caveat is that teams are also utilizing what’s called the “infield shift” more than ever to try to take away as many outs as possible based on where a batter is most likely to hit the ball. This has made many hitters even more conscious about hitting the ball in the air in order to attempt to “beat the shift”. Two of the main metrics that hitters have often attempted to enhance are launch angle and exit velocity. Launch angle is the degree of which the ball is coming off the bat, and exit velocity is the speed at which the ball is coming off the bat. Ground balls are classified as a launch angle <10, line drives 10-25, and fly balls 25-50.

A goal of our project is to study the “fly ball revolution” and if the increase in home runs is likely correlated with this theory.

Main metrics that we are going to cover in this project, with definitions

Launch Angle: The vertical angle at which the ball leaves a player’s bat after being struck Exit Velocity: The speed of the baseball as it leaves the player’s bat after being struck (in mph) Batting Average: Hits/At bats. The percentage of times a hitter is reaching base via a hit Weighted on base average (wOBA): Measures a player’s overall offensive contributions per plate appearance by taking the type of outcome into account Hard-hit ball: Defined by Statcast as coming off the bat at >= 95 MPH Spin Rate: Total numbers of revolutions a baseball has on its way to home plate after a pitcher releases it Expected Batting Average: A Statcast metric that measures the likelihood that a batted ball will result in a hit

Scraping each week of the 2019 season

## Retrieving Weekly Play-by-play Data
week1_2019 <- scrape_statcast_savant(start_date = "2019-02-28", end_date = "2019-04-07",
                                     player_type = "batter")
week2_2019 <- scrape_statcast_savant(start_date = "2019-04-08", end_date = "2019-04-14",
                                     player_type = "batter")
week3_2019 <- scrape_statcast_savant(start_date = "2019-04-15", end_date = "2019-04-21",
                                     player_type = "batter")
week4_2019 <- scrape_statcast_savant(start_date = "2019-04-22", end_date = "2019-04-28",
                                     player_type = "batter")
week5_2019 <- scrape_statcast_savant(start_date = "2019-04-29", end_date = "2019-05-05",
                                     player_type = "batter")
week6_2019 <- scrape_statcast_savant(start_date = "2019-05-06", end_date = "2019-05-12",
                                     player_type = "batter")
week7_2019 <- scrape_statcast_savant(start_date = "2019-05-13", end_date = "2019-05-19",
                                     player_type = "batter")
week8_2019 <- scrape_statcast_savant(start_date = "2019-05-20", end_date = "2019-05-26",
                                     player_type = "batter")
week9_2019 <- scrape_statcast_savant(start_date = "2019-05-27", end_date = "2019-06-02",
                                     player_type = "batter")
week10_2019 <- scrape_statcast_savant(start_date = "2019-06-03", end_date = "2019-06-09",
                                      player_type = "batter")
week11_2019 <- scrape_statcast_savant(start_date = "2019-06-10", end_date = "2019-06-16",
                                      player_type = "batter")
week12_2019 <- scrape_statcast_savant(start_date = "2019-06-17", end_date = "2019-06-23",
                                      player_type = "batter")
week13_2019 <- scrape_statcast_savant(start_date = "2019-06-24", end_date = "2019-06-30",
                                      player_type = "batter")
week14_2019 <- scrape_statcast_savant(start_date = "2019-07-01", end_date = "2019-07-07",
                                      player_type = "batter")
week15_2019 <- scrape_statcast_savant(start_date = "2019-07-08", end_date = "2019-07-14",
                                      player_type = "batter")
week16_2019 <- scrape_statcast_savant(start_date = "2019-07-15", end_date = "2019-07-21",
                                      player_type = "batter")
week17_2019 <- scrape_statcast_savant(start_date = "2019-07-22", end_date = "2019-07-28",
                                       player_type = "batter")
week18_2019 <- scrape_statcast_savant(start_date = "2019-07-29", end_date = "2019-08-04",
                                      player_type = "batter")
week19_2019 <- scrape_statcast_savant(start_date = "2019-08-05", end_date = "2019-08-11",
                                      player_type = "batter")
week20_2019 <- scrape_statcast_savant(start_date = "2019-08-12", end_date = "2019-08-18",
                                      player_type = "batter")
week21_2019 <- scrape_statcast_savant(start_date = "2019-08-19", end_date = "2019-08-25",
                                      player_type = "batter")
week22_2019 <- scrape_statcast_savant(start_date = "2019-08-26", end_date = "2019-09-02",
                                      player_type = "batter")
week23_2019 <- scrape_statcast_savant(start_date = "2019-09-03", end_date = "2019-09-09",
                                      player_type = "batter")
week24_2019 <- scrape_statcast_savant(start_date = "2019-09-10", end_date = "2019-09-16",
                                      player_type = "batter")
week25_2019 <- scrape_statcast_savant(start_date = "2019-09-17", end_date = "2019-09-23",
                                      player_type = "batter")
week26_2019 <- scrape_statcast_savant(start_date = "2019-09-24", end_date = "2019-09-30",
                                      player_type = "batter")
### Binding rows into single dataframe
Statcast_2019 <- bind_rows(week1_2019, week2_2019, week3_2019, week4_2019, week5_2019, week6_2019,
                           week7_2019, week8_2019, week9_2019, week10_2019, week11_2019, week12_2019,
                           week13_2019, week14_2019, week15_2019, week16_2019, week17_2019, week18_2019,
                           week19_2019, week20_2019, week21_2019, week22_2019, week23_2019, week24_2019,
                           week25_2019, week26_2019)

### Sample to see if it works
Statcast_sample <- Statcast_2019[sample(nrow(Statcast_2019),10),]
Statcast_sample
## # A tibble: 10 × 92
##    pitch_type game_date  release_speed release_pos_x release_pos_z player_name  
##    <chr>      <date>             <dbl>         <dbl>         <dbl> <chr>        
##  1 SI         2019-05-15          91.5         -1.62          5.8  Maldonado, M…
##  2 FF         2019-09-02          95.6         -1.8           6.87 Dyson, Jarrod
##  3 SL         2019-09-21          82           -2.28          5.57 Ward, Taylor 
##  4 SI         2019-09-21          89.7          2.68          6.07 Carpenter, M…
##  5 SL         2019-08-13          76.8          2.82          5.32 Crawford, J.…
##  6 SI         2019-06-15          94.2         -2.43          5.74 Walker, Chri…
##  7 SL         2019-05-20          81.6          3.3           5.84 Acuña Jr., R…
##  8 CH         2019-06-14          87.6         -1.42          5.92 Kipnis, Jason
##  9 FF         2019-06-08          92.4         -2.49          5.74 Smoak, Justin
## 10 SL         2019-08-18          83.3          2.96          5.17 Owings, Chris
## # … with 86 more variables: batter <dbl>, pitcher <dbl>, events <chr>,
## #   description <chr>, spin_dir <lgl>, spin_rate_deprecated <lgl>,
## #   break_angle_deprecated <lgl>, break_length_deprecated <lgl>, zone <dbl>,
## #   des <chr>, game_type <chr>, stand <chr>, p_throws <chr>, home_team <chr>,
## #   away_team <chr>, type <chr>, hit_location <dbl>, bb_type <chr>,
## #   balls <dbl>, strikes <dbl>, game_year <dbl>, pfx_x <dbl>, pfx_z <dbl>,
## #   plate_x <dbl>, plate_z <dbl>, on_3b <dbl>, on_2b <dbl>, on_1b <dbl>, …
write.csv(Statcast_2019, "StatcastHittingData19.csv", row.names = FALSE)

### Storing in SQL database
db <- dbConnect(SQLite(), db_name = "Statcast.sqlite")
dbWriteTable(conn = db, name = "Statcast_Hitting", Statcast_2019, overwrite = TRUE, row.names = FALSE )

### Actual stored object
Statcast_2019 <- dbGetQuery(conn = db, "SELECT * FROM Statcast_Hitting")

EDA - Home runs over time (2015-2019)

library(Lahman)
data("Teams")
HR <- Teams %>% 
  filter(yearID >= 2014, yearID <= 2019) %>%
  group_by(yearID)%>%
  summarize(n_hr = sum(HR, na.rm = TRUE))

## Graph of home runs over time
ggplot(HR, aes(yearID, n_hr))+
  geom_line(color = "black")+
  labs(x = NULL,
       y = "Home Runs", 
       title = "Major Spike in Home Run Totals in MLB",
       subtitle = "From 2015 to 2019",
       caption = "Data: Lahman database")+
  theme_bw() +
  theme(panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold", size = 10),
        plot.subtitle = element_text(hjust = 0.5, face = "bold", size = 8))

Home runs over time have increased dramatically. The years that we are focused on inlcude 2015-2019. Besides the small dip in 2018 of total home runs, the increase has been substantial. We aim to look at factors that might be causing this to happen. Our first prediction is that the implementation of advanced data in baseball has led to teams being able to optimize player performance by hitting the ball harder and farther than ever. This leads us to two of the main variables that we will focus on in our project, launch angle and exit velocity.

EDA - Launch angle vs Exit Velocity

## Batted Ball Type Characteristics Breakdown
kable(Statcast_2019 %>%
   mutate("Batted Ball Type" = fct_recode(bb_type, "Line Drive" = "line_drive", "Fly Ball" = "fly_ball", Popup = "popup", "Ground Ball" = "ground_ball")) %>%
  group_by(`Batted Ball Type`) %>%
  summarize("Average Launch Angle (°)" =  round(mean(launch_angle, na.rm=TRUE),1), "Average Launch Speed (MPH)" = round(mean(launch_speed, na.rm=TRUE), 1), "Total" = n()) %>%
  filter(!is.na(`Batted Ball Type`)) %>%
  arrange(desc(`Average Launch Angle (°)`)))
Batted Ball Type Average Launch Angle (°) Average Launch Speed (MPH) Total
Popup 64.1 77.3 9119
Fly Ball 36.5 92.0 29731
Line Drive 16.6 93.3 31488
Ground Ball -11.7 85.6 54958
## Hit Type Characteristics Breakdown
kable(Statcast_2019 %>%
  mutate("Hit Type" = fct_recode(events, Single = "single", Double = "double", Triple = "triple", "Home Run" = "home_run")) %>%
  filter(events %in% c("single", "double", "triple", "home_run")) %>%
  group_by(`Hit Type`) %>%
  summarize("Average Launch Angle (°)" = round(mean(launch_angle, na.rm=TRUE), 1), "Average Launch Speed (MPH)" = round(mean(launch_speed, na.rm=TRUE), 1), "Total" = n()) %>%
  arrange(desc(`Average Launch Angle (°)`)))
Hit Type Average Launch Angle (°) Average Launch Speed (MPH) Total
Home Run 28.2 103.5 6748
Triple 20.1 97.6 783
Double 16.5 97.3 8509
Single 6.6 90.3 25874
## Batted Ball Type Breakdown of Hits
kable(Statcast_2019 %>%
 mutate(ab = ifelse(type == "X", TRUE, FALSE), hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE), exb_hit = ifelse(events %in% c("double", "triple", "home_run"), TRUE,FALSE), "Batted Ball Type" = fct_recode(bb_type, "Line Drive" = "line_drive", "Fly Ball" = "fly_ball", Popup = "popup", "Ground Ball" = "ground_ball")) %>%
  filter(!is.na(hit)) %>%
  select(`Batted Ball Type`, hit, ab, exb_hit) %>%
  group_by(`Batted Ball Type`) %>%
  summarize("Total Hits" = sum(hit), "Total Extra Base Hits" = sum(exb_hit), "Total Batted Balls" = sum(ab)) %>%
  filter(!is.na(`Batted Ball Type`)) %>%
  arrange(desc(`Total Hits`)) %>%
  mutate("Hits Per Batted Balls (%)" = round((`Total Hits`/`Total Batted Balls`)*100, 1), "Extra Base Hits Per Batted Balls (%)" = round((`Total Extra Base Hits`/`Total Batted Balls`)*100, 1), "Extra Base Hits Per Total Hits (%)" = round((`Total Extra Base Hits`/`Total Hits`)*100, 1)))
Batted Ball Type Total Hits Total Extra Base Hits Total Batted Balls Hits Per Batted Balls (%) Extra Base Hits Per Batted Balls (%) Extra Base Hits Per Total Hits (%)
Line Drive 19699 7201 31487 62.6 22.9 36.6
Ground Ball 13119 1233 54956 23.9 2.2 9.4
Fly Ball 8904 7553 29731 29.9 25.4 84.8
Popup 192 53 9119 2.1 0.6 27.6

Overall Line Drives had the highest overall exit velocity (Launch speed) and were the second highest proportion of batted ball type in 2019

EDA - Launch angle vs exit velocity visualization

### Expected Batting average using launch angle vs exit velocity

parameters <- tibble(launch_angle = c(10, 25, 50),
                     launch_speed = 40,
                     label = c("Ground balls", "Line drives", "Fly balls"))

batting_average <- ggplot(Statcast_2019, aes(x = launch_speed, y = launch_angle, color = estimated_ba_using_speedangle)) +
  geom_point(alpha = .05) +
  geom_hline(data = parameters, aes(yintercept = launch_angle), color = "black", linetype = "dotted") +
  geom_text(data = parameters, aes(label = label, y = launch_angle - 4), color = "black", hjust = "left") +
  scale_color_gradient("xBA", low = "#6ac3cd", high = "#fc766a", label = percent)+
  theme_bw() +
  labs(title = "Launch angle vs. exit velocity in 2019",
       subtitle = "Line drives and fly balls with high exit velocity lead to a higher expected batting average",
       caption = "Data: baseballr | Bill Petti",
       x = "Exit Velocity (mph)",
       y = "Launch Angle (degrees)") +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
        plot.subtitle = element_text(size = 9, hjust = 0.5),
        legend.title = element_text(hjust = 0.5),
        plot.caption.position = "plot",
        panel.grid = element_blank()) +
  scale_y_continuous(breaks = seq(-75, 75, 25))
batting_average

This graph aims to show the relationship between launch angle and exit velocity and the chance that a batter will achieve a hit. The color is mapped by estimated batting average, which is a Statcast tool used to predict a batters chance of getting a hit based on their exit velocity and launch angle. The main takeaway that we see from this graph is that hitting the ball hard at an optimal launch angle gives the batter the best chance at reaching base via a hit. This can be seen in the saturated pockets of the graph with the orange color.

### Expected wOBA using launch angle vs exit velocity

woba <- ggplot(Statcast_2019, aes(x = launch_speed, y = launch_angle, color = estimated_woba_using_speedangle)) +
  geom_point(alpha = .05) +
  geom_hline(data = parameters, aes(yintercept = launch_angle), color = "black", linetype = "dotted") +
  geom_text(data = parameters, aes(label = label, y = launch_angle - 4), color = "black", hjust = "left") +
  scale_color_gradient("xwOBA", low = "#6ac3cd", high = "#fc766a", label=percent) +
  theme_bw() +
  labs(title = "Launch angle vs. exit velocity in 2019",
       subtitle = "Line drives and fly balls with high exit velocity lead to a higher expected wOBA",
       caption = "Data: baseballr | Bill Petti",
       x = "Exit Velocity (mph)",
       y = "Launch Angle (degrees)") +
  theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
        plot.subtitle = element_text(size = 9, hjust = 0.5),
        legend.title = element_text(hjust = 0.5),
        plot.caption.position = "plot",
        panel.grid = element_blank()) +
  scale_y_continuous(breaks = seq(-75, 75, 25))
woba

This is a very similar graph to the one above, but instead of using batting avgerage as the predictor, we are using weighted on base average (wOBA). wOBA is similar to batting average, but instead of assuming that all hits are created equally, wOBA attempts to capture all offensive value by giving different weights to different types of offensive outcomes (i.e. single, double, triple, home run). This is why in this graph there is an even more saturated pocket that can be seen. It makes it more apparent that this pocket will represent the more valuable types of hits such as doubles, triples and home runs. This balls in this pocket are generally hit harder and at a more optimal launch angle than some of the points on the previous graph.

Finding Ideal Launch Angle

## Launch Angle vs. Hit Distance vs. Hit Type for Each Batted Ball Type
Statcast_2019 %>%
 mutate(hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE)) %>%
  filter(hit == TRUE ) %>%
ggplot(aes(launch_angle,hit_distance_sc, color = fct_relevel(events, levels = c("single", "double", "triple", "home_run")))) +
  geom_point(na.rm=TRUE, alpha=0.5) +
  labs(x="Launch Angle", y="Hit distance", title="Ready to Launch", subtitle = "Hit Distance and Type of Hit Balls in regard to Launch Angle") +
  scale_color_discrete_qualitative(palette="Dark 3", name="Hit Type") +
  theme_gray() +
  facet_grid(fct_relevel(bb_type, levels = c("ground_ball", "line_drive", "fly_ball", "popup"))~., shrink=TRUE) + 
  geom_vline(xintercept =c(15,35), color = "black", linetype = "dashed")

## 3D Scatter-plot Incorporating Exit Velocity
Statcast_2019 %>%
  mutate(hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE, FALSE)) %>%
  filter(type == "X", hit == TRUE) %>%
plot_ly(x = ~launch_angle, y = ~launch_speed, z = ~hit_distance_sc, color = ~events, colors = c("#6ac3cd","#fc766a", "green", "purple"), position="jitter") %>% 
  add_markers() %>%
  add_lines(x=15, y=NULL, z=NULL, color=NULL) %>%
  add_lines(x=35, y=NULL, z=NULL, color=NULL) %>%
  layout(scene = list(xaxis = list(title = 'Launch Angle'),
                     yaxis = list(title = 'Launch Speed'),
                     zaxis = list(title = 'Hit Distance',
                                  trace4 = "15°")))

In the first graphic, by breaking down each batted ball into their respective category it allows a better visual interpretation of the proportion of hit types and distance each produce. Further, this displays how the distribution of the launch of different types of batted balls overlap. This being because the flight path of the ball coming off of the bat is impacted by the spin direction of the pitch, thus producing different outcomes. Including all batted ball outcomes the launch angle range with the highest average distance and proportion of hits, as well as extra-base hits, is between 15° and 35°. This range allowing for greatest potential in distance as well as hit type since batted ball outcome is not directly attributable to launch angle, and thus can not be controlled for. Looking now at the 3D Scatter-plot it displays a downward converging trend in both exit velocity and hit distance as launch angle increases past 35°. Here in the graphic trace 4 represents the lower limit of the ideal launch range (15°) and trace 5 the upper (35°). Further, for launch angles less than 15° there is a significantly greater portion or low distance and velocity “single” hit type outcomes, displaying that in this range there is a much higher likelihood of this resulting outcome than within the 15° to 35° range.

## Launch Angle vs. Exit Velocity vs. Hit Distance
paras <- data.frame(label = c("15°","35°", "95 (MPH)"), point = c(20,40,85), y=c(5,5, 100))
Statcast_2019 %>%
  mutate(ila = ifelse(launch_angle^2 >=15^2 & launch_angle^2 <= 35^2, "Ideal", ifelse(launch_angle^2 < 15^2,"Below", "Above"))) %>%
ggplot(aes(launch_angle^2,launch_speed^2, color = hit_distance_sc)) +
  geom_point(na.rm=TRUE, alpha=0.5) +
  geom_vline(xintercept=c(15^2,35^2), color = "blue", linetype = "dashed") +
  geom_hline(yintercept = 95^2, color="blue", linetype="dashed") +
  geom_smooth() + 
  scale_color_continuous_sequential("Sunset") +
  geom_text(data=paras, aes(label = label, x=point^2, y=y^2), color="blue") +
  theme_minimal() +
  labs(x="Launch Angle", y="Exit Velocity (MPH)", title="Liftoff!", subtitle="Relationship between launch angle and speed to hit distance", color="Hit Distance") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

By squaring the x and y axis variable it allow a greater display relationship that each unit increase in these variable interact with each other, and with hit distance. Along with the vertical limit lines displaying our theorized ideal launch angle range there is a horizontal limit displaying the threshold of batted balls deemed “Hard Hit” (>=95MPH launch speed). This displaying that overall this range produces the greatest combination of exit velocity and hit distance than those outside this range.

# Launch Angle Categories Breakdown of Batted Balls
kable(Statcast_2019 %>%
  mutate(ila = ifelse(launch_angle >=15 & launch_angle <= 35, "Ideal (15-35 degrees)", ifelse(launch_angle < 15,"Below 15 degrees", "Above 35 degrees")), hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE), exb_hit = ifelse(events %in% c("double", "triple", "home_run"), TRUE,FALSE), batted = ifelse(type =="X", TRUE, FALSE)) %>%
  select(ila, launch_speed, hit_distance_sc, hit, exb_hit, batted) %>%
  group_by(ila) %>%
  summarize("Average Hit Distance" = mean(hit_distance_sc, na.rm=TRUE), "Average Launch Speed" = mean(launch_speed, na.rm=TRUE), "Total Hits" = sum(hit), "Total Extra Base Hits" = sum(exb_hit), "Total In Play" = sum(batted)) %>%
  arrange(desc(`Average Hit Distance`)) %>%
  mutate("Proportion of Hits to Batted Balls" = `Total Hits`/`Total In Play`, "Proportion of Extra Base Hits to Hits" = `Total Extra Base Hits` / `Total Hits`, "Launch Angle Category" = ila) %>%
  select(`Launch Angle Category`, `Average Hit Distance`, `Average Launch Speed`, `Proportion of Hits to Batted Balls`, `Proportion of Extra Base Hits to Hits`) %>%
  filter(!is.na(`Launch Angle Category`)))
Launch Angle Category Average Hit Distance Average Launch Speed Proportion of Hits to Batted Balls Proportion of Extra Base Hits to Hits
Ideal (15-35 degrees) 276.36408 86.59104 0.5208381 0.6459439
Above 35 degrees 222.85339 79.22390 0.0635940 0.6271871
Below 15 degrees 60.56468 84.82648 0.3339746 0.1493617
##  Launch Angle Categories Breakdown of Hard Hit Batted Balls (Exit Velocity >= 95 MPH)
kable(Statcast_2019 %>%
  filter(launch_speed >= 95) %>%
  mutate(ila = ifelse(launch_angle >=15 & launch_angle <= 35, "Ideal (15-35 degrees)", ifelse(launch_angle < 15,"Below 15 degrees", "Above 35 degrees")), hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE), exb_hit = ifelse(events %in% c("double", "triple", "home_run"), TRUE,FALSE), batted = ifelse(type =="X", TRUE, FALSE)) %>%
  select(ila, launch_speed, hit_distance_sc, hit, exb_hit, batted) %>%
  group_by(ila) %>%
  summarize("Average Hit Distance" = mean(hit_distance_sc, na.rm=TRUE), "Total Hits" = sum(hit), "Total Extra Base Hits" = sum(exb_hit), "Total In Play" = sum(batted)) %>%
  arrange(desc(`Average Hit Distance`)) %>%
  mutate("Proportion of Hits to Batted Balls" = `Total Hits`/`Total In Play`, "Proportion of Extra Base Hits to Hits" = `Total Extra Base Hits` / `Total Hits`, "Launch Angle Category" = ila) %>%
  select(`Launch Angle Category`, `Total Hits`, `Total Extra Base Hits`, `Total In Play`, `Average Hit Distance`, `Proportion of Hits to Batted Balls`, `Proportion of Extra Base Hits to Hits`) %>%
  filter(!is.na(`Launch Angle Category`)))
Launch Angle Category Total Hits Total Extra Base Hits Total In Play Average Hit Distance Proportion of Hits to Batted Balls Proportion of Extra Base Hits to Hits
Ideal (15-35 degrees) 10937 9943 17005 359.87508 0.6431638 0.9091158
Above 35 degrees 662 654 4080 330.67847 0.1622549 0.9879154
Below 15 degrees 12792 2298 24559 96.91968 0.5208681 0.1796435

Batted balls that had launch angles within the 15° to 35° range an average distance 50 feet greater than the next highest category. Further, it possessed the greatest average launch speed as well as a significantly greater proportion of the batted balls resulting in a hit, and of those hits being extra base hits as well. Next isolating batted balls deemed to be hardly hit the ideal range had the highest total resulting in extra base hits, the highest average distance, as well as the highest proportion of balls resulting in hits. These tables displaying how the other ranges outside the ideal still result in a significant amount of hits, as well as hard hit balls but balls hit within the ideal range have the greatest potential for distance as well as becoming a hit.

Hit Distance Prediction Model

## Transforming and Re-leveling variables
s_cr <- Statcast_2019 %>%
  mutate(z_r = ifelse(zone < 7, ifelse(zone < 4, "low", "middle"), "high"), la_sq = launch_angle^2, ila = ifelse(launch_angle >=15 & launch_angle <= 35, "Ideal (15-35 degrees)", ifelse(launch_angle < 15,"Below 15 degrees", "Above 35 degrees")), ila = fct_relevel(ila, levels=c("Ideal (15-35 degrees)", "Below 15 degrees", "Above 35 degrees")), bb_type = fct_relevel(bb_type, levels=c("line_drive", "ground_ball", "fly_ball", "popup")), ls_sqr = sqrt(launch_speed), warm = ifelse(effective_speed >= 90, "Gas", ifelse(effective_speed <= 75, "slow", "Norm")))

## Creating Training and Test sets
split_dummy <- sample(c(rep(0, 0.7 * nrow(s_cr)), rep(1, 0.3 * nrow(s_cr))))
data_train <- s_cr[split_dummy == 0, ]
data_test <- s_cr[split_dummy == 1, ]

The variables that were chosen to be placed within the model are: effective pitch speed, release spin rate, the interaction of effective pitch speed and release spin rate, launch speed, launch angle, the interaction of launch speed angle, batted ball type, and pitch location. In transforming the data to be applied to the model the launch angle was converted to a dummy categorical variable specifying the ideal or above/below ranges, the pitch location variable was divided into three categories of high/middle/low, and the batted ball type was re-leveled with line drive placed first.

## Training Model
my_lm <- lm(hit_distance_sc ~ release_spin_rate*effective_speed + ila + launch_speed + bb_type + launch_speed_angle + z_r, data = data_train)

summary(my_lm)
## 
## Call:
## lm(formula = hit_distance_sc ~ release_spin_rate * effective_speed + 
##     ila + launch_speed + bb_type + launch_speed_angle + z_r, 
##     data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -188.255  -22.168   -3.214   25.271  161.898 
## 
## Coefficients:
##                                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                        1.677e+02  1.454e+01   11.531  < 2e-16 ***
## release_spin_rate                 -1.997e-02  6.273e-03   -3.183 0.001460 ** 
## effective_speed                   -6.384e-01  1.707e-01   -3.739 0.000185 ***
## ilaBelow 15 degrees               -9.999e+01  5.255e-01 -190.285  < 2e-16 ***
## ilaAbove 35 degrees               -2.800e+01  5.739e-01  -48.790  < 2e-16 ***
## launch_speed                       8.675e-01  1.281e-02   67.703  < 2e-16 ***
## bb_typeground_ball                -1.130e+02  5.236e-01 -215.726  < 2e-16 ***
## bb_typefly_ball                    4.724e+01  4.987e-01   94.724  < 2e-16 ***
## bb_typepopup                      -6.317e+01  9.113e-01  -69.322  < 2e-16 ***
## launch_speed_angle                 2.426e+01  1.808e-01  134.183  < 2e-16 ***
## z_rlow                             2.510e+00  4.095e-01    6.130 8.85e-10 ***
## z_rmiddle                          7.984e-01  3.056e-01    2.613 0.008984 ** 
## release_spin_rate:effective_speed  2.384e-04  7.389e-05    3.227 0.001253 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38 on 78456 degrees of freedom
##   (432387 observations deleted due to missingness)
## Multiple R-squared:  0.9251, Adjusted R-squared:  0.9251 
## F-statistic: 8.076e+04 on 12 and 78456 DF,  p-value: < 2.2e-16
kable(data.frame(my_lm$coefficients) %>%
  rownames_to_column(var = "Model Variables"))
Model Variables my_lm.coefficients
(Intercept) 167.6833899
release_spin_rate -0.0199659
effective_speed -0.6384398
ilaBelow 15 degrees -99.9894977
ilaAbove 35 degrees -27.9986276
launch_speed 0.8675357
bb_typeground_ball -112.9609979
bb_typefly_ball 47.2432600
bb_typepopup -63.1743010
launch_speed_angle 24.2560284
z_rlow 2.5101442
z_rmiddle 0.7984426
release_spin_rate:effective_speed 0.0002384
## Test Model
my_lm2 <- lm(hit_distance_sc ~ effective_speed*release_spin_rate + ila + launch_speed + bb_type + launch_speed_angle, data = data_test)

summary(my_lm2)
## 
## Call:
## lm(formula = hit_distance_sc ~ effective_speed * release_spin_rate + 
##     ila + launch_speed + bb_type + launch_speed_angle, data = data_test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -184.011  -22.464   -3.052   25.362  157.349 
## 
## Coefficients:
##                                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                        1.985e+02  2.262e+01    8.773  < 2e-16 ***
## effective_speed                   -1.033e+00  2.658e-01   -3.887 0.000102 ***
## release_spin_rate                 -3.585e-02  9.770e-03   -3.669 0.000243 ***
## ilaBelow 15 degrees               -9.973e+01  8.157e-01 -122.264  < 2e-16 ***
## ilaAbove 35 degrees               -2.841e+01  8.817e-01  -32.227  < 2e-16 ***
## launch_speed                       8.988e-01  1.992e-02   45.132  < 2e-16 ***
## bb_typeground_ball                -1.120e+02  8.106e-01 -138.151  < 2e-16 ***
## bb_typefly_ball                    4.763e+01  7.668e-01   62.111  < 2e-16 ***
## bb_typepopup                      -6.277e+01  1.417e+00  -44.286  < 2e-16 ***
## launch_speed_angle                 2.456e+01  2.782e-01   88.277  < 2e-16 ***
## effective_speed:release_spin_rate  4.210e-04  1.151e-04    3.658 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.37 on 33548 degrees of freedom
##   (185378 observations deleted due to missingness)
## Multiple R-squared:  0.9239, Adjusted R-squared:  0.9239 
## F-statistic: 4.074e+04 on 10 and 33548 DF,  p-value: < 2.2e-16
kable(data.frame(my_lm2$coefficients) %>%
  rownames_to_column(var = "Model Variables"))
Model Variables my_lm2.coefficients
(Intercept) 198.4665681
effective_speed -1.0330593
release_spin_rate -0.0358503
ilaBelow 15 degrees -99.7330183
ilaAbove 35 degrees -28.4133500
launch_speed 0.8988250
bb_typeground_ball -111.9883970
bb_typefly_ball 47.6276931
bb_typepopup -62.7710412
launch_speed_angle 24.5622019
effective_speed:release_spin_rate 0.0004210
# Ranking Predictors
kable(data.frame(anova(my_lm2) %>% mutate(percent_var = `Sum Sq`/sum(`Sum Sq`), mean_percent_var = `Mean Sq`/sum(`Sum Sq`))))
Df Sum.Sq Mean.Sq F.value Pr..F. percent_var mean_percent_var
effective_speed 1 300429.30 3.004293e+05 204.07568 0.0000000 0.0004628 0.0004628
release_spin_rate 1 1092030.27 1.092030e+06 741.79456 0.0000000 0.0016823 0.0016823
ila 2 456135609.84 2.280678e+08 154921.94816 0.0000000 0.7027087 0.3513544
launch_speed 1 59703300.97 5.970330e+07 40555.27128 0.0000000 0.0919771 0.0919771
bb_type 3 70992657.85 2.366422e+07 16074.63603 0.0000000 0.1093691 0.0364564
launch_speed_angle 1 11479193.60 1.147919e+07 7797.58913 0.0000000 0.0176845 0.0176845
effective_speed:release_spin_rate 1 19702.68 1.970268e+04 13.38364 0.0002542 0.0000304 0.0000304
Residuals 33548 49387571.03 1.472147e+03 NA NA 0.0760850 0.0000023
## Plotting the Model's Residuals
autoplot(my_lm2, which=1:3)

Looking at the power of the model it is significantly high with a total of ~92.5% of the variation in hit distance being explained. Further, the standard error of the model being ~38 feet is permissible since there are intangible variables that effect distance that are or can not be included in the model. It appears that in the model the predictors which positively effect distance are: launch speed, interaction of launch speed and angle, interaction of effective pitch speed and release spin rate, pitch location middle/low, batted ball type fly ball. Further, those with negative relationships, or reduce hit distance, were effective pitch speed, release spin rate, launch angle range above 35°/below 15°, and batted ball type pop up/ground ball. This effectively displaying hitting hard (above 95MPH exit velocity) line drives or fly balls that have a launch angle of 15° to 35°, on characteristically fastballs, is the best theoretical approach to hitting according to the model. These batted balls having the highest potential for overall hit distance, which is correlated with a greater proportion hits as well as extra base hits. It does appear that for balls that are within 200 feet the model innacurately predicts them but these balls are exposed to a much higher degree of intangible effect and thus are outliers in regard to the majority of batted balls, thus not displaying true heteroscedasity in the residuals.

## Expected Batting Average by Pitch Location
ehit_plot(Statcast_2019, title="Pitch Location and Expected Batting Average") +
  theme_classic() +
  scale_fill_viridis(option="inferno", name="Expected Batting Average", label=percent)

## Average Launch Angle by Pitch Location
la_plot(Statcast_2019, title ="Pitch Location and Launch Angle") +
  theme_classic() + 
  scale_fill_viridis(option="inferno", name="Projected Launch Angle")

## Home Run Probability by Pitch Location
home_run_plot(Statcast_2019, title ="Pitch Location and Homeruns") +
  theme_classic() +
  scale_fill_viridis(option="inferno", name="Probability of Homeruns", label = percent)

Plotting expected batting average, average launch angle, and home run probability in regard to pitch location in the strike zone better displays the effects of pitch location on hit distance than the model coefficients. Independently launch angle is highest for pitches up and center in the strike zone, expected batting average is highest in the lower heart of the strike zone, and home run probability is highest in the direct center of the strike zone. Thus, the lower bound of the ideal launch angle range begins in the lower heart of the strike zone. Thus, while the average of our ideal range is displayed to result from high pitches in the strike zone to a greater degree those pitches occurring in the middle and lower portions of the strike zone produce a much higher probability of the batted ball being a hit in general as well as a home run. Therefor displaying that while the ideal launch angle is seen more in the higher portion of the strike zone it are the pitches within the middle and lower portion of the zone that provide a better chance of getting a hit and greater hit distance as well.